english-german dataset
1325cdae3b6f0f91a1b629307bf2d498-Supplemental.pdf
C.1 Datasetdescription For WMT'16 English-German experiment, we used the same preprocessed data provided by [31] 1, including the samevalidation(neewsteest2013)andtest (neewsteest2014) splits. The data volume for train, validation and test splits are 4500966, 3000, 3003 sentence pairs respectively. When using LayerDrop we use 50% dropout probability. Similarly,we use beam search with beam size 5and length penalty 1.0 for decoding. First, we show that adding the auxiliary lossLK discretizes the samples and achieve the pruning purpose byenforcingsparsity oftheresulting model.